V

INTELLIGENT CONSTRUCTION OF APPLICATION — ORIENTED DATABASES

V.V.Kluev, S.Y.Garnaev

Saint Petersburg State University, St.Petersburg Institute of Fine Mechanics and Optics

Аннотация — В работе рассмотрен способ автоматизированного формирования предметно – ориентированных баз данных в Internet на основе использования программы – робота. Представлены принципы функционирования программы – робота. Обсужден порядок ее настройки за предметную область, взаимодействие с человеком – экспертом по отбору релевантных документов. Проанализированы результаты экспериментов для ряда предметных областей.

Introduction

Nowadays there are a lot of specialized subject servers on the Internet. Tremendous financial, human and time resources were spent to build thematic collections for them. Traditionally the following schema is used for construction of specialized servers. An expert manually searches relevant information in the Internet. This process takes a lot of time. For example, the database for WWW-server ‘Physics in the Internet’, supported by Russian Basic Research Foundation (project 96-07 89149, http://www.physics.nw.ru) have been built for three years. A database contains 3000 documents. The traditional schema has to be altered if the results of the project ‘Open Architecture Server for Information Search and Delivery’ — OASIS are used. The expert who is responsible for some part of database forms a collection from documents existed in the database. This collection is a core of the most relevant documents from the given topic. All documents are in the same language. For a new topic he/she has to find core of documents in the Internet by traditional manner. Task for the Crawler is to find similar documents in the Internet. The documents retrieved by the Crawler as relevant are moved to the expert so that he/she could make the final decision whether they are worth being included in the database.

How does the Crawler work?

The Crawler is a program that recursively retrieves documents from the Internet and checks the found documents for relevance to the database topic. For each document a score is calculated according to the Crawler’s filter. The filter describes the database topic. The filter contains terms with associated term’s weights and threshold. The threshold is a positive number which is used to decide whether a document is relevant enough to the database. The relevance score is calculated using a document-term frequency profile. A document is relevant to database if its score is greater than the threshold. From each retrieved document the Crawler extracts URLs (Uniform Resource Locators) and forms a queue of found URLs. The queue is sorted according to document’s score. The Crawler takes URLs from this queue to retrieve the documents.

OASIS Crawler has been implemented as a part of the OASIS project [1,2] and is currently in the alpha stage.

Technology of construction of application – oriented databases

Bellow we describe the technology used in the experiments.

Expert manually finds M documents, which are relevant to the database topic. These documents are a core of constructed database.
A dictionary is built automatically for the database core. The stop words (the, or, then and etc.) are deleted. Each word is converted to term. Term is stem of the word. For these purposes the Porter’s stemming algorithm is used [3]. For each term tf-weight is calculated [4]. The dictionary is sorted according to tf-weights.
The filter is formed from first N terms.
The threshold is determined on the base of the run of the Crawler. The process of fitting of the threshold is iterative. After retrieving database core the Crawler stops. The threshold is suitable if recall is K%. (Recall is one of the standard performance measurements for information filtering systems. It is a fraction of the actually relevant documents that are selected). The aim of the filter is to maximize the number of actually relevant documents selected for database and to minimize the “garbage”.
The Crawler is running during H hours or it is running until Z documents will be evaluated. URLs of evaluated and recommended documents are stored. Also, the queue of found URLs are stored. This is important if the Crawler is restarted.
After that duplicates from “mirrors” are automatically deleted and documents are sorted according to directories of their hosts. The aim of the last process is to reduce manually analysis.
The expert manually estimates documents recommended by the Crawler and selects the ones which are relative to the database topic.
This process can be repeated from step 5. (When the Crawler is restarted it uses all the stored information. It never recommends documents, which has already been recommended).

Experiments

To test mentioned above technology, it was necessary to construct test databases consisting of real Web documents with human evaluations of the pages’ topics. Since standard test collection, such as TREC collections, does not exist for real Internet data, some categories in the Yahoo! directory tree were selected and the pages listed in those categories were used to form test databases. The next categories were selected: Benchmarks, Museums, Research Groups, Travelogues, Information Retrieval, Card Games, Programming Languages, Unix-Linux. Documents listed in any given Yahoo! category are assumed to be relevant to the only category’s topic.

Table 1 shows selected values of parameters mentioned above. The database core contains 100 – 150 documents, the filter is formed from 1000 terms, the threshold provides recall, which equal 50%. The Crawler has run 48 - 385 hours.

Parameter values selection

Table 1.

Parameter	Value
M	100 - 150
N	1000
K	50%
H	48 - 385
Z	100000

Table 2 shows results of experiments for construction of collections. The Crawler had run on the server of Dublin University (hardware-Pentium II, 330MHZ, software-Linux). The process had worked as background process. The speed of collections gathering was very different. In average 4 documents were recommended for each 100 evaluated documents. An amount of duplicate – documents (i.e. documents gathered from “mirrors”) was different.

Documents recommended for collections Information Retrieval and Physics were totally manually analyzed. For another collections some documents were analyzed from each directory and also all the documents, which did not belong to any directory.

Manual expert analysis showed that the average percent of recommended documents that were actually junk (i.e. they did not belong to the given topic) was approximately 61%. So, in average four from each ten recommended by the Crawler documents were correct. In other words precision was about 39% (precision is the fraction of the recommended documents that are actually relevant). The best result was showed by the physics collection: “garbage” is about 20%.

Table 3 shows some interesting results. There were two starts of the Crawler for two collections: Information Retrieval and Card Games. The filter’s length for one of them was 1000 terms, for another was 100. The short filter was constructed manually by expert. To construct topic filter, 100 significant terms were selected from dictionary of the collection. Precision increased for both collections.

Construction of Collection

Table 2.

Collection	Core	Time: Hours	Evaluat. Docs	Recom. Docs	Dup. Docs	Manually Analysis	Relev. Docs	Precision
Travelogues	149	82	~8000	2883	62	Dir. + file. 553 + 338	226 (48 + 137)	0.08
Research Groups	100	88	~22500	3091	60	777 + 134	811 (372 + 45)	0.29
Museums	121	150	~75000	3631	444	1780 + 392	444 (227 + 39)	0.14
Benchmarks	110	44	~16000	1233	350	136 +33	649 (85 + 24)	0.73
Programming Languages	121	40	12500	1004	22	331 + 88	445 (202 + 18)	0.45
Information Retrieval	110	385	~91000	594	14	580	202	0.34
Physics	50	240	100000	607	33	574	467	0.81
Card Games	110	10	10000	4170	114	134 + 129	798 (550+78)	0.2

The effect of reduce of filter’s length

Table 3.

Collection	Filter’s length	Time: Hours	Evaluat. Docs	Recom. Docs	Relev. Docs	Precision
Information Retrieval	1000	385	91000	594	202	0.34
Information Retrieval	100	47	< 30000	479	229	0.49
Card Games	1000	10	10000	4150	798	0.2
Card Games	100	10	10000	675	219	0.32

Conclusion

The OASIS Crawler can independently construct an application – oriented databases of Internet documents on the base of an information filtering technology and heuristics. Preliminary results have shown that this technology has promising performance characteristics. The hypothesis about reducing a length of filter has to be confirmed in the further experimental tests.

References

M.Bessonov, U.Heuser, I.Nekrestyanov, A. Patel. Open architecture for distributed search systems. In Proc. of the Sixth International Conference on Intelligence in Services and Networks. Apr. 1999.

А.Г.Бабанин, М.Б.Бессонов, В.В.Клюев, Л.А.Петросян, A.Patel, W.Rosental OASIS — система поиска информации в Internet нового поколения. Всероссийская научно – методическая конференция “Интернет и современное общество”. Тезисы докладов. - С. 105 –106. - СПб. - 1998.

M.F.Porter An Algorithm For Sufficx Stripping. Program 14 (3), July 1980, pp. 130 – 137.

G.Salton Automatic text processing: the transformation, analysis, and retrieval of Information by Computer. Reading, MA: Addison-Wesley, 1989.

Site of Information Technologies
Designed by inftech@webservis.ru.